Skip to content

Detect orphaned KG certs via public key modulus comparison#6020

Open
Robbie-Microsoft wants to merge 8 commits into
mainfrom
rginsburg/mtls_restart_orphan_check
Open

Detect orphaned KG certs via public key modulus comparison#6020
Robbie-Microsoft wants to merge 8 commits into
mainfrom
rginsburg/mtls_restart_orphan_check

Conversation

@Robbie-Microsoft
Copy link
Copy Markdown
Contributor

@Robbie-Microsoft Robbie-Microsoft commented May 20, 2026

Problem

After a VM or app restart, KeyGuard regenerates the per-boot private key in the same CNG container while old binding certs remain on disk under CN=managedidentitysnissuer.login.microsoft.com. The certs appear valid (time-wise, HasPrivateKey=true) but the key material has changed (or, in the zombie-VBS case, the container metadata survives but the private material is dead). MSAL loads such certs from the persistent store, attempts an mTLS handshake, and fails.

This PR addresses three distinct failure modes:

Mode Symptom Root cause
1. Container empty CngKey.Open throws / GetRSAPrivateKey() returns null VBS reaped per-boot key; CreateFresh hasn't run yet in this process
2. Different key in container Public key on cert doesn't match container CreateFresh ran (with OverwriteExistingKey); cert references previous key material
3. Zombie VBS Public metadata matches cert but signing fails Container metadata survives reboot; private material is dead

Fix — three complementary layers

1. Per-Read modulus comparison (WindowsPersistentCertificateCache)

Compare the cert's embedded public key (modulus + exponent) against the public key currently in the CNG container. A mismatch conclusively means the container was regenerated — the cert is orphaned. Runs inside WindowsPersistentCertificateCache.Read during candidate selection. Orphaned certs are removed from the store on detection so they don't accumulate and don't require the modulus check on every subsequent Read.

Covers: Mode 2 (and non-reboot drift scenarios).

2. CanSign liveness probe (WindowsCngKeyOperations.TryGetOrCreateKeyGuard)

A 1-byte RSA-SHA256 PKCS1 sign right after CngKey.Open. ~1–3 ms, runs once per process (the result is cached in WindowsManagedIdentityKeyProvider._cachedKey). If the probe fails, the container is treated as dead and CreateFresh is invoked.

Covers: Mode 3. Without this, ExportParameters(false) returns the old modulus from intact metadata, the per-Read modulus check passes, and the failure surfaces only at the mTLS handshake.

3. Issuer-CN sweep on fresh-key mint (PurgeManagedIdentityCertificates)

One-shot substring sweep of CurrentUser\My for CN=managedidentitysnissuer.login.microsoft.com, invoked at the moment a fresh KeyGuard key is minted (both the probe-failed path and the Open-threw path). Removes orphan binding certs at the cause site.

Covers: Modes 1 and 2 at the cause site — multi-identity hosts (SAMI + UAMIs sharing the KeyGuard container) are cleaned up uniformly, no per-Read discovery cost.

The reactive SChannel catch in ImdsV2ManagedIdentitySource is retained as a defensive backstop.

Design notes

  • Non-CNG keys (software CSP) are accepted on faith — KG regeneration does not apply
  • GetRSAPrivateKey() returning null for an RSA cert is treated as orphaned (inaccessible = unusable); non-RSA certs (ECDSA, etc.) are accepted on faith
  • CryptographicException during key load is treated as orphaned
  • The CNG modified-time heuristic (Check 3 from Dragos's proposal) is intentionally omitted — the modulus comparison is definitive and covers the same case without the false-negative window
  • The internal TryRead overload accepts an injectable isOrphaned delegate for testability; IPersistentCertificateCache interface is unchanged
  • RemoveByThumbprints is wrapped in InterprocessLock.TryWithAliasLock (300 ms best-effort) — same pattern as Write/Delete/DeleteAllForAlias
  • Cert-store snapshotting consolidated into a single SnapshotCertificates helper

Testing

Unit tests (Windows-only where RSACng is required):

PersistentCertificateStoreUnitTests.cs — 6 tests:

  • IsCertKeyOrphaned_ReturnsTrue_For_NullCert
  • IsCertKeyOrphaned_ReturnsFalse_For_ValidCert
  • IsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKey
  • PublicKeyMatchesCert_ReturnsTrue_When_KeyMatchesCert
  • PublicKeyMatchesCert_ReturnsFalse_When_ModulusMismatch (different RSACng simulates post-reboot key replacement)
  • Read_RemovesOrphanedCert_FromStore (verifies cert is deleted from store and Read returns false)

WindowsCngKeyOperationsPurgeUnitTests.cs — 4 tests for the purge filter:

  • Matching issuer CN
  • Non-matching issuer CN
  • Case-insensitive matching
  • Only removes matching certs

Full unit suite: 2068 passed, 0 failed, 19 skipped (net8.0).

E2E validation (Gladwin, Server 2022 KeyGuard VM, multiple reboots, mixed SAMI/UAMI):

  • CngKey.Open threw CryptographicException HR=0x8009003A
  • Fresh KeyGuard key created
  • PurgeManagedIdentityCertificates removed orphan cert (Inspected=4)
  • MAA attestation OK → POST /issuecredential → 200
  • mTLS handshake → 200 on first try (no reactive catch invoked)
  • Total ~2.8s on cold start

Credits

Modulus-comparison layer authored by @Robbie-Microsoft. CanSign probe + issuer-CN sweep authored by @gladjohn (cherry-picked from #6037). Approach discussed with @bgavrilMS, @Raghavendra-ms, @dragosav.

Companion to #6017 (token_not_after OID eviction — different failure mode). Refs #6031.

After a VM or app restart, KeyGuard regenerates the per-boot private key in
the same CNG container while the old cert remains on disk. The cert appears
valid (time-wise, HasPrivateKey=true) but the key material has changed.

Fix: compare the cert's embedded public key modulus against the public key
currently exported from the CNG container. A mismatch conclusively means the
container was regenerated and the cert is orphaned. The check runs inside
WindowsPersistentCertificateCache.Read during candidate selection, so the
loop continues looking for a valid cert rather than returning an unusable one.

Non-CNG keys (software CSP) are accepted on faith since KeyGuard-style
regeneration does not apply to them. CryptographicException during key load
is treated as orphaned (key inaccessible = unusable).

The CNG container modified-time heuristic (Check 3 from the original proposal)
is intentionally omitted. Both it and the modulus comparison detect the same
event, but the modulus check is definitive: independently generated RSA keys
sharing a modulus is computationally infeasible. The modified-time check is a
heuristic with a known false-negative window and adds no additional coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 22:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Windows KeyGuard-specific “orphaned cert” detection to MSAL’s persistent mTLS certificate selection logic by validating that the certificate’s embedded RSA public key matches the current RSA public key in the associated CNG container. This prevents MSAL from reusing a persisted cert whose key container was regenerated after reboot, avoiding repeated SCHANNEL handshake failures.

Changes:

  • Add modulus/exponent comparison helpers (IsCertKeyOrphaned, PublicKeyMatchesCert) to detect CNG-container/key mismatches.
  • Update WindowsPersistentCertificateCache.Read to skip orphaned candidates and continue searching for a usable cert.
  • Add unit tests covering the new helper methods.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs Adds tests for orphan detection and RSA public key matching logic.
src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs Skips persisted cert candidates whose CNG container key no longer matches the cert’s embedded public key.
src/client/Microsoft.Identity.Client/ManagedIdentity/V2/MtlsCertificateCache.cs Introduces orphan-detection helpers based on RSA modulus/exponent comparison with logging on crypto failures.
Comments suppressed due to low confidence (1)

tests/Microsoft.Identity.Test.Unit/ManagedIdentityTests/PersistentCertificateStoreUnitTests.cs:912

  • Same issue as the KeyMatches test: on net48, RSA.Create(2048) commonly returns RSACryptoServiceProvider, so key2 as RSACng will be null and the test will fail. Create key2 as an RSACng explicitly (or gate the test on key2 is RSACng) to keep the multi-targeted test suite passing.
            using var key1 = RSA.Create(2048);
            using var key2 = RSA.Create(2048);
            var rsaCng2 = key2 as RSACng;
            Assert.IsNotNull(rsaCng2, "Expected RSACng on Windows.");

RSA.Create(2048) does not return RSACng on ARM64 Windows (.NET 8).
Use new RSACng(2048) explicitly so the cast is always valid.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Robbie-Microsoft Robbie-Microsoft marked this pull request as ready for review May 20, 2026 22:43
@Robbie-Microsoft Robbie-Microsoft requested a review from a team as a code owner May 20, 2026 22:43
Copilot AI review requested due to automatic review settings May 20, 2026 22:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Robbie-Microsoft and others added 2 commits May 20, 2026 18:57
When IsCertKeyOrphaned detects a cert whose CNG container key no longer
matches the cert's public key (post-reboot KeyGuard regeneration), the
cert is now removed from the Windows cert store instead of just skipped.

Previously the orphaned cert stayed in the store indefinitely, causing
a redundant modulus check on every Read call.

Changes:
- WindowsPersistentCertificateCache.Read: collect orphaned cert thumbprints
  during the scan loop; after the loop, open ReadWrite store and remove them
  (best-effort, wrapped in try/catch)
- Added internal Read overload accepting Func<X509Certificate2, ILoggerAdapter, bool>
  isOrphaned delegate for testability; IPersistentCertificateCache interface unchanged
- Added unit test: Read_RemovesOrphanedCert_FromStore

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetRSAPrivateKey() returns null for both non-RSA certs (ECDSA, DSA) and
for RSA certs with an inaccessible private key. Previously both fell into
the 'accept on faith' branch, which could allow a useless RSA cert to pass
the orphan check.

Fix: check GetRSAPublicKey() to distinguish RSA certs from non-RSA certs.
An RSA cert with no accessible private key is treated as orphaned.
Non-RSA certs continue to be accepted on faith (KG regeneration does not apply).

Add unit test: IsCertKeyOrphaned_ReturnsTrue_For_RsaCert_WithNoPrivateKey

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 23:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/client/Microsoft.Identity.Client/ManagedIdentity/V2/WindowsPersistentCertificateCache.cs:190

  • RemoveByThumbprints uses List<string> + thumbprints.Contains(...) inside a loop over all certificates, making removal O(storeSize * orphanCount). Since this can run on the read path after a reboot, consider using a HashSet<string> (e.g., OrdinalIgnoreCase) for thumbprints (and/or X509Certificate2Collection.Find by thumbprint) to keep the removal cost predictable.
                int removed = 0;
                foreach (var cert in items)
                {
                    try
                    {
                        if (thumbprints.Contains(cert.Thumbprint))
                        {
                            store.Remove(cert);
                            removed++;

…ock orphan removal

- Rename internal Read overload to public TryRead with out param last
  (public Read delegates to TryRead with the default IsCertKeyOrphaned check)
- Extract SnapshotCertificates(store, logger) helper to eliminate 5 duplicate
  try/catch CopyTo fallback blocks across Read, Write, DeleteAllForAlias,
  PruneExpiredForAlias, and RemoveByThumbprints
- Wrap RemoveByThumbprints in InterprocessLock.TryWithAliasLock (300ms,
  best-effort) — same pattern as Write/Delete/DeleteAllForAlias to prevent
  concurrent store writes during orphan cleanup
- Update orphan removal test to call TryRead with the new signature

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gladjohn gladjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions to address before merge so #6031 closes here in one PR:

  1. Add a CanSign liveness probe right after CngKey.Open in WindowsCngKeyOperations.TryGetOrCreateKeyGuard - 1-byte RSA-SHA256 sign. Catches the zombie-VBS case where the container's public metadata survives a reboot but the private material is dead. In that state ExportParameters(false) still returns the old modulus, so IsCertKeyOrphaned returns false and we'd fall through to a failed mTLS handshake.

  2. Add a Broad issuer-CN sweep of CurrentUser\My when a fresh KG key is minted - when the container was just regenerated, every cert under CN=managedidentitysnissuer.login.microsoft.com is orphaned by definition. One-shot cleanup at the cause site, no per-Read discovery cost, handles multi-identity hosts uniformly. Your per-Read removal still covers non-reboot drift.

I validated both E2E on a Server 2022 KeyGuard VM, and it works great.

@gladjohn
Copy link
Copy Markdown
Contributor

For reference, I pushed both additions as a draft in #6037 with the full implementation and unit test, feel free to cherry-pick from there or I can re-spin them as commits on this branch, whichever is easier.

…TLS certs on reboot

Fixes the post-reboot recovery path for IMDSv2 mTLS PoP token acquisition.
On Azure VM restart the per-boot KeyGuard key (NCryptUsePerBootKeyFlag) is
reaped by VBS, but the persisted binding cert under
CN=managedidentitysnissuer.login.microsoft.com still references the old
public key. The next call then either burns a failed TLS handshake before
the reactive SChannel catch kicks in, or — in the zombie-handle variant —
falls through entirely because the cert's modulus still matches the dead
container.

Changes
-------
- Add CanSign liveness probe right after CngKey.Open in
  WindowsCngKeyOperations.TryGetOrCreateKeyGuard. 1-byte RSA-SHA256 PKCS1
  sign; ~1-3ms, runs once per process (result is cached in
  WindowsManagedIdentityKeyProvider._cachedKey). Catches zombie-VBS state
  where Open succeeds but private material is dead.

- Add PurgeManagedIdentityCertificates: one-shot issuer-CN substring sweep
  of CurrentUser\My, invoked at the moment a fresh KeyGuard key is minted
  (both the probe-failed path and the Open-threw path). Removes orphaned
  binding certs at the cause site so the next request doesn't pay any
  per-Read discovery cost and multi-identity hosts (SAMI + UAMIs sharing
  the KeyGuard container) are cleaned up uniformly.

- Add 4 Windows-only unit tests for the purge filter behavior (matching,
  non-matching, case-insensitive, only-removes-matching).

The reactive SChannel catch in ImdsV2ManagedIdentitySource is retained as
a defensive backstop.

Validation
----------
Validated E2E on a Server 2022 KeyGuard VM across multiple reboots and
mixed SAMI/UAMI cases. Canonical post-reboot first call:
  - CngKey.Open threw CryptographicException HR=0x8009003A
  - Fresh KeyGuard key created
  - PurgeManagedIdentityCertificates removed orphan cert (Inspected=4)
  - MAA attestation OK
  - POST /issuecredential -> 200
  - mTLS handshake -> 200 on first try (no reactive catch invoked)
  - Total ~2.8s on cold start

Full unit suite green on net8.0: 2069 passed, 0 failed, 19 skipped.

Refs #6031.
Complementary to #6020 (cert-side modulus comparison): this PR adds the
key-side liveness probe and broad issuer-CN sweep at the mint site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 27, 2026 22:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

@Robbie-Microsoft
Copy link
Copy Markdown
Contributor Author

For reference, I pushed both additions as a draft in #6037 with the full implementation and unit test, feel free to cherry-pick from there or I can re-spin them as commits on this branch, whichever is easier.

cherry-picked 1144df9 into this PR as 8b0b9a8. Your CanSign probe covers the zombie-VBS case my modulus check misses, and the issuer-CN sweep handles multi-identity hosts cleanly

- RemoveByThumbprints: use HashSet<string>(OrdinalIgnoreCase) for thumbprint lookup (defensive; .NET's Thumbprint getter always returns uppercase, but normalize anyway).
- PurgeManagedIdentityCertificates: skip null entries in the snapshot consuming loop (defensive against TOCTOU between Certificates.Count and enumeration).
- WindowsPersistentCertificateCache.TryRead (testability overload): throw ArgumentNullException when the injected isOrphaned delegate is null.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 27, 2026 22:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +439 to +457
// Snapshot to avoid 'collection modified during enumeration' provider quirks.
var snapshot = new X509Certificate2[store.Certificates.Count];
try
{
store.Certificates.CopyTo(snapshot, 0);
}
catch (Exception copyEx)
{
logger?.Info(() =>
$"[MI][WinKeyProvider] PurgeManagedIdentityCertificates: store snapshot via CopyTo failed " +
$"({copyEx.GetType().Name}: {copyEx.Message}). Falling back to enumeration.");

int i = 0;
snapshot = new X509Certificate2[store.Certificates.Count];
foreach (X509Certificate2 c in store.Certificates)
{
snapshot[i++] = c;
}
}
@gladjohn gladjohn self-requested a review May 27, 2026 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants